Overview

Dataset statistics

Number of variables12
Number of observations48842
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory4.5 MiB
Average record size in memory96.0 B

Variable types

NUM8
CAT3
BOOL1

Reproduction

Analysis started2020-06-24 15:10:00.166747
Analysis finished2020-06-24 15:10:15.231039
Duration15.06 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Unnamed: 0 has unique values Unique
score has 13093 (26.8%) zeros Zeros
capital-gain has 44807 (91.7%) zeros Zeros
capital-loss has 46560 (95.3%) zeros Zeros

Variables

Unnamed: 0
Real number (ℝ≥0)

UNIQUE

Distinct count48842
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean24420.5
Minimum0
Maximum48841
Zeros1
Zeros (%)< 0.1%
Memory size381.6 KiB

Quantile statistics

Minimum0
5-th percentile2442.05
Q112210.25
median24420.5
Q336630.75
95-th percentile46398.95
Maximum48841
Range48841
Interquartile range (IQR)24420.5

Descriptive statistics

Standard deviation14099.61526
Coefficient of variation (CV)0.5773680007
Kurtosis-1.2
Mean24420.5
Median Absolute Deviation (MAD)12210.5
Skewness0
Sum1192746061
Variance198799150.5
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
20471< 0.1%
 
74971< 0.1%
 
341381< 0.1%
 
402811< 0.1%
 
382321< 0.1%
 
115991< 0.1%
 
95501< 0.1%
 
156931< 0.1%
 
136441< 0.1%
 
34031< 0.1%
 
Other values (48832)48832> 99.9%
 
ValueCountFrequency (%) 
01< 0.1%
 
11< 0.1%
 
21< 0.1%
 
31< 0.1%
 
41< 0.1%
 
ValueCountFrequency (%) 
488411< 0.1%
 
488401< 0.1%
 
488391< 0.1%
 
488381< 0.1%
 
488371< 0.1%
 

income
Boolean

Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size381.6 KiB
0
37155
1
11687
ValueCountFrequency (%) 
03715576.1%
 
11168723.9%
 

score
Real number (ℝ≥0)

ZEROS

Distinct count437
Unique (%)0.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.242394501902149
Minimum0.0
Maximum1.0
Zeros13093
Zeros (%)26.8%
Memory size381.6 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0.06
Q30.37
95-th percentile0.98
Maximum1
Range1
Interquartile range (IQR)0.37

Descriptive statistics

Standard deviation0.3289513128
Coefficient of variation (CV)1.357090653
Kurtosis-0.06091413434
Mean0.2423945019
Median Absolute Deviation (MAD)0.06
Skewness1.216295538
Sum11839.03226
Variance0.1082089662
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
01309326.8%
 
0.0135227.2%
 
0.0222184.5%
 
0.0317643.6%
 
117223.5%
 
0.0414883.0%
 
0.0512212.5%
 
0.0610642.2%
 
0.089622.0%
 
0.079231.9%
 
Other values (427)2086542.7%
 
ValueCountFrequency (%) 
01309326.8%
 
0.00255< 0.1%
 
0.0033333333334< 0.1%
 
0.0042< 0.1%
 
0.0057< 0.1%
 
ValueCountFrequency (%) 
117223.5%
 
0.994651.0%
 
0.982620.5%
 
0.972250.5%
 
0.962260.5%
 

gender
Categorical

Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size381.6 KiB
Male
32650
Female
16192
ValueCountFrequency (%) 
Male3265066.8%
 
Female1619233.2%
 

Length

Max length6
Median length4
Mean length4.663035912
Min length4

race
Categorical

Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size381.6 KiB
White
41762
Black
 
4685
Asian-Pac-Islander
 
1519
Amer-Indian-Eskimo
 
470
Other
 
406
ValueCountFrequency (%) 
White4176285.5%
 
Black46859.6%
 
Asian-Pac-Islander15193.1%
 
Amer-Indian-Eskimo4701.0%
 
Other4060.8%
 

Length

Max length18
Median length5
Mean length5.529400925
Min length5

marital-status
Categorical

Distinct count7
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size381.6 KiB
Married-civ-spouse
22379
Never-married
16117
Divorced
6633
Separated
 
1530
Widowed
 
1518
Other values (2)
 
665
ValueCountFrequency (%) 
Married-civ-spouse2237945.8%
 
Never-married1611733.0%
 
Divorced663313.6%
 
Separated15303.1%
 
Widowed15183.1%
 
Married-spouse-absent6281.3%
 
Married-AF-spouse370.1%
 

Length

Max length21
Median length13
Mean length14.40604398
Min length7

age
Real number (ℝ≥0)

Distinct count74
Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean38.64358543876172
Minimum17
Maximum90
Zeros0
Zeros (%)0.0%
Memory size381.6 KiB

Quantile statistics

Minimum17
5-th percentile19
Q128
median37
Q348
95-th percentile63
Maximum90
Range73
Interquartile range (IQR)20

Descriptive statistics

Standard deviation13.71050993
Coefficient of variation (CV)0.35479394
Kurtosis-0.1842687406
Mean38.64358544
Median Absolute Deviation (MAD)10
Skewness0.5575803166
Sum1887430
Variance187.9780827
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
3613482.8%
 
3513372.7%
 
3313352.7%
 
2313292.7%
 
3113252.7%
 
3413032.7%
 
3712802.6%
 
2812802.6%
 
3012782.6%
 
3812642.6%
 
Other values (64)3576373.2%
 
ValueCountFrequency (%) 
175951.2%
 
188621.8%
 
1910532.2%
 
2011132.3%
 
2110962.2%
 
ValueCountFrequency (%) 
90550.1%
 
892< 0.1%
 
886< 0.1%
 
873< 0.1%
 
861< 0.1%
 

fnlwgt
Real number (ℝ≥0)

Distinct count28523
Unique (%)58.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean189664.13459727284
Minimum12285
Maximum1490400
Zeros0
Zeros (%)0.0%
Memory size381.6 KiB

Quantile statistics

Minimum12285
5-th percentile39615.4
Q1117550.5
median178144.5
Q3237642
95-th percentile379481.65
Maximum1490400
Range1478115
Interquartile range (IQR)120091.5

Descriptive statistics

Standard deviation105604.0254
Coefficient of variation (CV)0.5567949135
Kurtosis6.057848212
Mean189664.1346
Median Absolute Deviation (MAD)60295.5
Skewness1.438891879
Sum9263575662
Variance1.115221019e+10
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
20348821< 0.1%
 
19029019< 0.1%
 
12027719< 0.1%
 
12589218< 0.1%
 
12656918< 0.1%
 
12667517< 0.1%
 
11336417< 0.1%
 
9918517< 0.1%
 
18693416< 0.1%
 
11156716< 0.1%
 
Other values (28513)4866499.6%
 
ValueCountFrequency (%) 
122851< 0.1%
 
134921< 0.1%
 
137693< 0.1%
 
138621< 0.1%
 
148781< 0.1%
 
ValueCountFrequency (%) 
14904001< 0.1%
 
14847051< 0.1%
 
14554351< 0.1%
 
13661201< 0.1%
 
12683391< 0.1%
 

education-num
Real number (ℝ≥0)

Distinct count16
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean10.078088530363212
Minimum1
Maximum16
Zeros0
Zeros (%)0.0%
Memory size381.6 KiB

Quantile statistics

Minimum1
5-th percentile5
Q19
median10
Q312
95-th percentile14
Maximum16
Range15
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.570972756
Coefficient of variation (CV)0.2551051966
Kurtosis0.6257452728
Mean10.07808853
Median Absolute Deviation (MAD)1
Skewness-0.3165248567
Sum492234
Variance6.60990091
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
91578432.3%
 
101087822.3%
 
13802516.4%
 
1426575.4%
 
1120614.2%
 
718123.7%
 
1216013.3%
 
613892.8%
 
49552.0%
 
158341.7%
 
Other values (6)28465.8%
 
ValueCountFrequency (%) 
1830.2%
 
22470.5%
 
35091.0%
 
49552.0%
 
57561.5%
 
ValueCountFrequency (%) 
165941.2%
 
158341.7%
 
1426575.4%
 
13802516.4%
 
1216013.3%
 

capital-gain
Real number (ℝ≥0)

ZEROS

Distinct count123
Unique (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1079.0676262233324
Minimum0
Maximum99999
Zeros44807
Zeros (%)91.7%
Memory size381.6 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile5013
Maximum99999
Range99999
Interquartile range (IQR)0

Descriptive statistics

Standard deviation7452.019058
Coefficient of variation (CV)6.905979641
Kurtosis152.6930963
Mean1079.067626
Median Absolute Deviation (MAD)0
Skewness11.894659
Sum52703821
Variance55532588.04
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
04480791.7%
 
150245131.1%
 
76884100.8%
 
72983640.7%
 
999992440.5%
 
31031520.3%
 
51781460.3%
 
50131170.2%
 
43861080.2%
 
8614820.2%
 
Other values (113)18993.9%
 
ValueCountFrequency (%) 
04480791.7%
 
1148< 0.1%
 
4015< 0.1%
 
594520.1%
 
91410< 0.1%
 
ValueCountFrequency (%) 
999992440.5%
 
413103< 0.1%
 
340956< 0.1%
 
27828580.1%
 
2523614< 0.1%
 

capital-loss
Real number (ℝ≥0)

ZEROS

Distinct count99
Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean87.50231358257237
Minimum0
Maximum4356
Zeros46560
Zeros (%)95.3%
Memory size381.6 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0
Maximum4356
Range4356
Interquartile range (IQR)0

Descriptive statistics

Standard deviation403.0045521
Coefficient of variation (CV)4.605644532
Kurtosis20.01434595
Mean87.50231358
Median Absolute Deviation (MAD)0
Skewness4.569808858
Sum4273788
Variance162412.669
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
04656095.3%
 
19023040.6%
 
19772530.5%
 
18872330.5%
 
2415720.1%
 
1485710.1%
 
1848670.1%
 
1590620.1%
 
1602620.1%
 
1876590.1%
 
Other values (89)10992.3%
 
ValueCountFrequency (%) 
04656095.3%
 
1551< 0.1%
 
2135< 0.1%
 
3235< 0.1%
 
4193< 0.1%
 
ValueCountFrequency (%) 
43563< 0.1%
 
39002< 0.1%
 
37704< 0.1%
 
36832< 0.1%
 
31752< 0.1%
 

hours-per-week
Real number (ℝ≥0)

Distinct count96
Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean40.422382375824085
Minimum1
Maximum99
Zeros0
Zeros (%)0.0%
Memory size381.6 KiB

Quantile statistics

Minimum1
5-th percentile17.05
Q140
median40
Q345
95-th percentile60
Maximum99
Range98
Interquartile range (IQR)5

Descriptive statistics

Standard deviation12.39144402
Coefficient of variation (CV)0.3065490774
Kurtosis2.95105909
Mean40.42238238
Median Absolute Deviation (MAD)3
Skewness0.2387496572
Sum1974310
Variance153.547885
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
402280346.7%
 
5042468.7%
 
4527175.6%
 
6021774.5%
 
3519374.0%
 
2018623.8%
 
3017003.5%
 
5510512.2%
 
259582.0%
 
487701.6%
 
Other values (86)862117.7%
 
ValueCountFrequency (%) 
1270.1%
 
2530.1%
 
3590.1%
 
4840.2%
 
5950.2%
 
ValueCountFrequency (%) 
991370.3%
 
9814< 0.1%
 
972< 0.1%
 
969< 0.1%
 
952< 0.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

Unnamed: 0incomescoregenderracemarital-statusagefnlwgteducation-numcapital-gaincapital-losshours-per-week
0000.00MaleBlackNever-married2522680270040
1100.22MaleWhiteMarried-civ-spouse388981490050
2210.95MaleWhiteMarried-civ-spouse28336951120040
3311.00MaleBlackMarried-civ-spouse44160323107688040
4400.00FemaleWhiteNever-married18103497100030
5500.00MaleWhiteNever-married3419869360030
6600.00MaleBlackNever-married2922702690040
7710.44MaleWhiteMarried-civ-spouse63104626153103032
8800.00FemaleWhiteNever-married24369667100040
9900.01MaleWhiteMarried-civ-spouse5510499640010

Last rows

Unnamed: 0incomescoregenderracemarital-statusagefnlwgteducation-numcapital-gaincapital-losshours-per-week
488324883200.01MaleAmer-Indian-EskimoMarried-civ-spouse323406660040
488334883300.26MaleWhiteMarried-civ-spouse4384661110045
488344883400.04MaleAsian-Pac-IslanderNever-married32116138140011
488354883510.76MaleWhiteMarried-civ-spouse53321865140040
488364883600.00MaleWhiteNever-married22310152100040
488374883700.04FemaleWhiteMarried-civ-spouse27257302120038
488384883810.73MaleWhiteMarried-civ-spouse4015437490040
488394883900.03FemaleWhiteWidowed5815191090040
488404884000.00MaleWhiteNever-married2220149090020
488414884111.00FemaleWhiteMarried-civ-spouse52287927915024040